Miraç Lütfullah Gülgönül


Analyzing Gene Functions in Arabidopsis Thaliana

03 February 2024

The Arabidopsis Information Resource (TAIR) maintains a database of genetic and molecular biology data for the model higher plant Arabidopsis thaliana. We can do all sorts of analyses with this data, but in this post we will focus on looking at gene functions. You can download the annotation data (ATH_GO_GOSLIM.txt.gz) from the TAIR site, but I also included it in my Github repo mlg556/arabidopsis, so you can just clone that.

First let’s import the libraries. We will use pandas for reading/analyzing the data, and zipfile to extract the data file, which is archived. You can simply install pandas via pip install pandas, and zipfile is a built-in python library.

import pandas as pd
import zipfile

We extract the archived file ATH_GO_GOSLIM.txt.gz to ATH_GO_GOSLIM.txt. Note that although the extension .gz suggests Gzip/7zip file, we need a regular zip extractor.

# extract txt, file extension is erroneously gz, but its actually a zip file
fname = "ATH_GO_GOSLIM.txt"
fname_gz = f"{fname}.gz"

with zipfile.ZipFile(fname_gz, 'r') as z:
    z.extractall(".")

The resulting file is txt file using \t as a delimiter. The column header names is absent, but we can read that from the companion file ATH_GO.README.txt, which I also included in the repo. To read the data into a pandas dataframe, I used read_csv. We specify the separator with sep="\t", and the column names. We need skip the first 4 rows because they are comments:

!Project_name: The Arabidopsis Information Resource (TAIR)
!URL: http://www.arabidopsis.org
!Contact Email: curator@arabidopsis.org
!Last Updated: 2023-12-01

Finally we print the dataframe see what it looks like

# column data from from ATH_GO.README.txt
columns = [
    "locus_name",
    "tair_acc",
    "obj_name",
    "rel_type",
    "go_term",
    "go_id",
    "tair_id",
    "aspect",
    "go_slim",
    "evidence_code",
    "evidence_desc",
    "evidence_with",
    "reference",
    "annotator",
    "date"
]


# read into dataframe
df0 = pd.read_csv(fname, sep="\t", names=columns, skiprows=[0,1,2,3])
df0
locus_name tair_acc obj_name rel_type go_term go_id tair_id aspect go_slim evidence_code evidence_desc evidence_with reference annotator date
0 AT1G01010 locus:2200935 AT1G01010 acts upstream of or within defense response to other organism GO:0098542 46569 P response to stress IEA traceable computational prediction AGI_LocusCode:AT2G43510|AGI_LocusCode:AT4G1473… Publication:501796011|PMID:34562334 klaasvdp 2022-11-14
1 AT1G01010 locus:2200935 AT1G01010 acts upstream of or within response to oxidative stress GO:0006979 6625 P response to stress IEA traceable computational prediction AGI_LocusCode:AT5G19875 Publication:501796011|PMID:34562334 klaasvdp 2022-11-14
2 AT1G01010 locus:2200935 AT1G01010 acts upstream of or within response to abscisic acid GO:0009737 11395 P response to chemical IEA traceable computational prediction AGI_LocusCode:AT4G27410 Publication:501796011|PMID:34562334 klaasvdp 2022-11-14
3 AT1G01010 locus:2200935 AT1G01010 acts upstream of or within response to lipid GO:0033993 28865 P response to chemical IEA traceable computational prediction AGI_LocusCode:AT4G27410|AGI_LocusCode:AT2G0299… Publication:501796011|PMID:34562334 klaasvdp 2022-11-14
4 AT1G01010 locus:2200935 AT1G01010 acts upstream of or within oxoacid metabolic process GO:0043436 21524 P other cellular processes IEA traceable computational prediction AGI_LocusCode:AT5G63790 Publication:501796011|PMID:34562334 klaasvdp 2022-11-14
456201 YAK gene:1945468 YAK is active in cellular_component GO:0005575 163 C unknown cellular components ND ‘Unknown’ cellular component NONE Communication:1345790 TAIR 2022-02-01
456202 YAK gene:1945468 YAK enables molecular_function GO:0003674 3226 F unknown molecular functions ND ‘Unknown’ molecular function NaN Communication:1345790 TAIR 2006-10-20
456203 YI gene:1945470 YI involved in biological_process GO:0008150 5239 P unknown biological processes ND ‘Unknown’ biological process NONE Communication:1345790 TAIR 2022-02-01
456204 YI gene:1945470 YI is active in cellular_component GO:0005575 163 C unknown cellular components ND ‘Unknown’ cellular component NONE Communication:1345790 TAIR 2022-02-01
456205 YI gene:1945470 YI enables molecular_function GO:0003674 3226 F unknown molecular functions ND ‘Unknown’ molecular function NaN Communication:1345790 TAIR 2006-10-20

456206 rows × 15 columns

As you can see, this is a quite large dataframe with 456\,206 genes and their associated functions. The go_term column is a nice description of what the gene does, so we will use that. I have also noticed that there are 3 GO terms we need to exclude, because they are very generic and not useful: ["molecular_function", "biological_process", "cellular_component"].

We select the columns by indexing into the dataframe, drop duplicates, and finally drop the rows which contain the excluded go_terms with query(). We could also use indexing for this, but query looks nicer I think.

select_columns = ["locus_name", "go_term"]

excluded_go_terms = ["molecular_function", "biological_process", "cellular_component"]

df = df0[select_columns] # select gene name and function
df = df.drop_duplicates() # drop duplicates
df = df.query("go_term not in @excluded_go_terms") # exclude go_terms

df
locus_name go_term
0 AT1G01010 defense response to other organism
1 AT1G01010 response to oxidative stress
2 AT1G01010 response to abscisic acid
3 AT1G01010 response to lipid
4 AT1G01010 oxoacid metabolic process
456175 XRS9 DNA damage response
456176 XRS9 DNA repair
456182 XRS9 response to X-ray
456183 XTC1 embryo development ending in seed dormancy
456190 XTC2 embryo development ending in seed dormancy

175917 rows × 2 columns

We are left with a table 175\,917 of genes with their corresponding functions, but I want to analyze them by function. Let’s see which functions are done by which genes. We do this by using groupby:

dfg = df.groupby("go_term", as_index=False).apply(lambda x: x)
dfg
locus_name go_term
0 59240 AT1G36280 ‘de novo’ AMP biosynthetic process
264420 AT3G57610 ‘de novo’ AMP biosynthetic process
301471 AT4G18440 ‘de novo’ AMP biosynthetic process
1 52413 AT1G30820 ‘de novo’ CTP biosynthetic process
161072 AT2G34890 ‘de novo’ CTP biosynthetic process
7460 97088 AT1G69770 zygote asymmetric cytokinesis in embryo sac
118068 AT2G01210 zygote asymmetric cytokinesis in embryo sac
412150 AT5G49160 zygote asymmetric cytokinesis in embryo sac
7461 129874 AT2G17090 zygote elongation
157944 AT2G33160 zygote elongation

175917 rows × 2 columns

We see that, for example ‘de novo’ AMP biosynthetic process is regulated by three different genes: AT1G36280, AT3G57610 and AT4G18440. We can see which functions are regulated by the most number of genes using grouby combined with count, and sort them in descending order:

dfg_sum = df.groupby("go_term").count()
dfg_sum = dfg_sum.sort_values(by=['locus_name'], ascending=False)
dfg_sum
locus_name
go_term
nucleus 10494
chloroplast 4966
cytoplasm 4771
protein binding 4519
mitochondrion 4449
organelle fission 1
beta,beta digalactosyldiacylglycerol galactosyltransferase activity 1
regulation of MAPK cascade 1
regulation of GTPase activity 1
regulation of trichome patterning 1

7462 rows × 1 columns

This is nice, but not very helpful. Go terms like nucleus and protein binding seem to me too broad to be meaningful. Of course at the other end we have very specific functions regulated by a single gene. I would like to see more of what’s in between, so let’s look at the distribution:

import matplotlib.pyplot as plt
distribution = dfg_sum.iloc[:50].plot.bar(figsize=(20, 5))
Gene function distribution.

Apart from the usual suspects, the more interesting functions to me are “response to water deprivation”, “response to light stimulus” and “response to abscisic acid”. We see that the type of things which are more important to a plant have more genes to express/regulate them. Abscisic acid is a plant hormone that regulates numerous aspects of plant growth, development, and stress responses.

Another thing worthy of notice is that the distribution seems to follow a power law and the Pareto principle. I don’t know if this is to be expected or not.

plot = dfg_sum.iloc[:100].plot(figsize=(20, 5), marker=".", grid=True)
plot.plot([10_000/x for x in range(1,100)])
_ = plot.legend(["locus_name", "1/x"])
Power law.

Citations